Skip to content

Apple NPU acceleration integrated into llama.cpp, using MiniCPM-V 4.0 as an example. #15262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

tc-mb
Copy link
Contributor

@tc-mb tc-mb commented Aug 12, 2025

As stated in #14983, I have integrated Apple NPU (ANE) acceleration into llama.cpp.

Using MiniCPM-V 4.0 as an example, I will introduce a simple way to use ANE and hope we can discuss a better approach.

  1. Build llama.cpp locally,I added an ENABLE_ANE option to control whether ANE is used.
cmake -B build -DENABLE_ANE=ON
cmake --build build --config Release -j 8
  1. Download ane in Hugging Face or Modelscope, If you downloaded the zip file, please unzip it.

  2. Used like mmproj, I added the "--ane" interface. The path is the downloaded ane_minicpmv4_vit_f16.mlmodelc file address.

./build/bin/llama-mtmd-cli -m {dir_path}/ggml-model-Q4_0.gguf --mmproj {dir_path}/mmproj-model-f16.gguf --ane {dir_path}/ane_minicpmv4_vit_f16.mlmodelc -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image {dir_path}/xx.png -p "Describe the content of the image in detail." 

I tested ANE acceleration on several devices. The benchmark results are as follows:

mac M2   image size   MiniCPM-V 4.0(ANE) MiniCPM-V 4.0
q4_K_M 1 448×448 prefill time(ms) 790.26 5716.77
  2 600×600 prefill time(ms) 1894.24 17961.35
  3 700×700 prefill time(ms) 2954.34 27866.59
  4 800×800 prefill time(ms) 2964.44 27946.48
  5 1024×625 prefill time(ms) 2977.56 30111.43
  6 1024×768 prefill time(ms) 2975.98 30415.11
  7 1280×960 prefill time(ms) 4065.79 41889.12
mac M4       MiniCPM-V 4.0(ane) MiniCPM-V 4.0
q4_K_M 1 448×448 prefill time(ms) 412.57 736.57
  2 600×600 prefill time(ms) 989.44 3365.09
  3 700×700 prefill time(ms) 1564.61 4031.90
  4 800×800 prefill time(ms) 1555.85 4124.81
  5 1024×625 prefill time(ms) 1563.65 5405.13
  6 1024×768 prefill time(ms) 1567.45 5169.05
  7 1280×960 prefill time(ms) 2141.54 7544.96

A point worth noting: The first time ANE is used, there is a loading time and it will be slightly slower. After that, as long as ANE is not updated, it will remain ready and waiting in the system.

@github-actions github-actions bot added examples python python script changes labels Aug 12, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks OK. Need to improve encapsulation of the CoreML code (see comments). Would need a review from @ngxson.

Also:

  • Use "CoreML" instead of "ANE"
  • Would eventually need instructions for generating the CoreML inference code - can add those after the PR is approved

Comment on lines 98 to 100
bool ane_embedding(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, float * vec);
bool ane_resampler(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, const float * vit_embedding, float * vec);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to expose this in the public interface

Comment on lines +115 to +117

// ANE support functions
void clip_set_ane_model_path(struct clip_ctx * ctx, const char * ane_model_path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should find a way to avoid this. Maybe we can do something similar to whisper.cpp:

https://github.com/ggml-org/whisper.cpp/blob/f7502dca872866a310fe69d30b163fa87d256319/src/whisper.cpp#L3351-L3373

@@ -82,6 +82,7 @@ struct mtmd_context_params {
enum ggml_log_level verbosity;
const char * image_marker; // deprecated, use media_marker instead
const char * media_marker;
const char * ane_model_path; // path to ANE model for iOS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the term "ane", use the term "coreml" as it is more correct. CoreML models can run not only the Apple Neural Engine, but also on the GPU and CPU.

Comment on lines +3845 to +3852

static int flag = 0;
static const void* coremlEncoder = NULL;
static std::string cached_model_path = "";

// Check if we need to load a new model
if (flag == 0 || (ane_model_path && cached_model_path != ane_model_path)) {
if (coremlEncoder) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid this global state. Figure out a way to move this to the clip context.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global idea is good. However, I think we should take time to make sure this can be useful in the long term.

The biggest issue atm is that many TODO are being copied in the PR, which will make refactoring very difficult in the future. We must resolve this problem first.

Related to UX, if we cannot have embeddings and resampler all in one CoreML model, I think we should separate it into 2 repos on hugging face or modelscope. One having only ggml implementation and one have CoreML. Having everything in the same place seems very confusing for most users, and most of them don't even have time to look at this PR.

Comment on lines +3894 to +3896
static bool ane_embedding(clip_ctx * ctx, const int n_threads, const clip_image_f32_batch * imgs_c_ptr, float * vec) {
const clip_image_f32_batch & imgs = *imgs_c_ptr;
int batch_size = imgs.entries.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel quite comfortable duplicating this function, as you're also duplicating many TODO, which will make cleaning this up extremely difficult in the future.

We should find a way to merge it with an existing function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your idea, I will try to modify it.

Comment on lines +3878 to +3879
float * vit_embedding1 = (float *)malloc(1100*1152*sizeof(float));
float * vit_embedding2 = (float *)malloc(1100*1152*sizeof(float));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid malloc because we had a lot of mem leaks in the old code base of clip.cpp. Use std::vector<float> instead.

Comment on lines +3881 to +3883
ane_embedding(ctx, n_threads, &imgs, vit_embedding1);
clip_image_encode_ane(vit_embedding1, vit_embedding2, ctx->ane_model_path.c_str());
ane_resampler(ctx, n_threads, &imgs, vit_embedding2, vec);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like only the ViT part is done by ANE, the rest (embeddings, resampler) is sill done by ggml. Any reason why we can't do the rest with ANE? I think it could be a cleaner approach as we can now be able to load only .mlmodelc file and no more mmproj.gguf file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we should try ggml_custom_4d and inject the clip_image_encode_ane as a node on ggml cgraph. If that works, it will make everything looks much cleaner. Do you think it's a valid use case of ggml_custom_4d @ggerganov ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson Yes, only the vit is currently being replaced with ane now.
Because the embed calculations aren't yet correctly calculated with ane, I've bypassed the two embed calculations and only replaced the vit itself.
I'm also still trying other methods to see if there's a solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we should try ggml_custom_4d and inject the clip_image_encode_ane as a node on ggml cgraph. If that works, it will make everything looks much cleaner. Do you think it's a valid use case of ggml_custom_4d

Haven't considered such use case for ggml_custom_4d. Sounds like worth exploring.

@tc-mb
Copy link
Contributor Author

tc-mb commented Aug 13, 2025

@ggerganov @ngxson Yes, I understand that introducing a new feature requires more time to discuss its design, including its name, structure, and interface definition. All of this takes time. I have plenty of time to prepare for this. I will follow the discussion and ensure that this feature is incorporated into llama.cpp in a proper manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants